DBSCAN — Density-Based Clustering (Finding crowds instead of shapes)#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that looks for crowds: regions where points are packed together, separated by regions that are sparse.

That density-first view gives you two superpowers:

  • arbitrary shapes (clusters don’t have to be “ball-shaped”)

  • noise detection (points that don’t belong anywhere can be labeled as noise)

The cost: DBSCAN is sensitive to a distance scale (eps) and can struggle when clusters have very different densities.

DBSCAN is also commonly used after dimensionality reduction (PCA/UMAP/t-SNE): first embed to a space where distance is meaningful, then run a density-based clusterer.


Learning goals#

By the end you should be able to:

  • explain DBSCAN as “finding crowds instead of shapes”

  • define ε-neighborhoods and distinguish core / border / noise points

  • describe density reachability and the cluster expansion process

  • see (with Plotly) how eps and min_samples change the result

  • compare DBSCAN with k-means and HDBSCAN (and know when to use which)


Notation (quick)#

  • Dataset: X = {x1, …, xn} with xi R^d

  • Distance metric: dist(x, z) (we’ll use Euclidean unless stated)

  • ε-neighborhood of point p:

    Nε(p) = { q X : dist(p, q) ε }

  • min_samples: minimum number of points in Nε(p) (usually counting p itself) for p to be a core point

  • DBSCAN labels:

    • cluster IDs: 0, 1, 2, ...

    • noise: -1 (in scikit-learn)


Table of contents#

  1. Intuition: finding crowds instead of shapes

  2. Density concepts: ε-neighborhood, core/border/noise

  3. Algorithm steps: density reachability + cluster expansion

  4. DBSCAN from scratch (NumPy)

  5. Plotly lab: effect of eps and min_samples

  6. Plotly lab: noise detection + choosing eps

  7. Comparison: DBSCAN vs k-means vs HDBSCAN

  8. scikit-learn DBSCAN: practical parameter map

  9. Practical checklist + pitfalls

  10. Exercises

  11. References

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from plotly.colors import qualitative

from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons, make_blobs
from sklearn.metrics import adjusted_rand_score
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Intuition: “finding crowds instead of shapes”#

Many clustering algorithms start by assuming a shape.

  • k-means assumes clusters look like spheres (in your feature space): “assign points to the closest center.”

  • DBSCAN assumes clusters look like crowds: “a cluster is where you can keep walking from point to nearby point without ever crossing a sparse desert.”

That’s why DBSCAN can discover non-convex clusters (like two interleaving moons) and also label outliers as noise.

# A classic non-convex dataset + some uniform noise
X_moons, _ = make_moons(n_samples=450, noise=0.06, random_state=42)

noise = rng.uniform(
    low=X_moons.min(axis=0) - 0.6,
    high=X_moons.max(axis=0) + 0.6,
    size=(60, 2),
)
X = np.vstack([X_moons, noise])

fig = px.scatter(x=X[:, 0], y=X[:, 1], title="Two moons + uniform noise (no clustering yet)")
fig.update_traces(marker=dict(size=6, opacity=0.85))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

2) Density concepts: ε-neighborhood, core/border/noise#

DBSCAN is built from one local question:

“Around this point, is it crowded?”

ε-neighborhood#

Pick a radius ε (epsilon). The ε-neighborhood of a point p is all points within distance ε.

Core / border / noise points#

Given ε and min_samples:

  • Core point: has at least min_samples points in its ε-neighborhood.

  • Border point: not core, but lies inside the ε-neighborhood of a core point.

  • Noise point: neither core nor border.

Let’s visualize a single ε-neighborhood as a circle.

def circle_xy(center_xy: np.ndarray, radius: float, n: int = 200) -> tuple[np.ndarray, np.ndarray]:
    t = np.linspace(0, 2 * np.pi, n)
    cx, cy = center_xy
    return cx + radius * np.cos(t), cy + radius * np.sin(t)


eps_demo = 0.25
p_idx = 25
p = X[p_idx]

dists = np.linalg.norm(X - p, axis=1)
nbr_mask = dists <= eps_demo

fig = go.Figure()

# All points (faded)
fig.add_trace(
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        name="all points",
        marker=dict(size=6, color="lightgray", opacity=0.5),
    )
)

# Neighbors within eps
fig.add_trace(
    go.Scatter(
        x=X[nbr_mask, 0],
        y=X[nbr_mask, 1],
        mode="markers",
        name=f"Nε(p) (ε={eps_demo})",
        marker=dict(size=7, color=qualitative.Set2[1], opacity=0.9),
    )
)

# The chosen point p
fig.add_trace(
    go.Scatter(
        x=[p[0]],
        y=[p[1]],
        mode="markers",
        name="p",
        marker=dict(size=11, color=qualitative.Set2[0], symbol="star"),
    )
)

# Circle
cx, cy = circle_xy(p, eps_demo)
fig.add_trace(go.Scatter(x=cx, y=cy, mode="lines", name="ε-ball", line=dict(color="black", width=1)))

fig.update_layout(title="An ε-neighborhood is a circle around p", legend=dict(orientation="h"))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
def point_types(X: np.ndarray, eps: float, min_samples: int) -> np.ndarray:
    """Return an array of strings: 'core' / 'border' / 'noise' for each point."""
    X = np.asarray(X)
    eps2 = float(eps) ** 2

    sq = np.sum(X**2, axis=1, keepdims=True)
    sq_dists = sq + sq.T - 2 * (X @ X.T)
    sq_dists = np.maximum(sq_dists, 0.0)

    nbr_counts = np.sum(sq_dists <= eps2, axis=1)
    is_core = nbr_counts >= min_samples

    # border: non-core point that is in the eps-neighborhood of at least one core point
    is_border = (~is_core) & (np.any((sq_dists <= eps2) & is_core[None, :], axis=1))
    is_noise = (~is_core) & (~is_border)

    types = np.empty(X.shape[0], dtype=object)
    types[is_core] = "core"
    types[is_border] = "border"
    types[is_noise] = "noise"
    return types


eps_demo = 0.25
min_samples_demo = 6
types = point_types(X, eps=eps_demo, min_samples=min_samples_demo)

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=types,
    category_orders={"color": ["core", "border", "noise"]},
    color_discrete_map={"core": qualitative.Set2[0], "border": qualitative.Set2[1], "noise": "lightgray"},
    title=f"Core / border / noise given ε={eps_demo}, min_samples={min_samples_demo}",
)
fig.update_traces(marker=dict(size=6, opacity=0.9))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

3) Algorithm steps: density reachability + cluster expansion#

DBSCAN turns local density into clusters using reachability.

Density reachability#

  • A point q is directly density-reachable from p if:

    1. q Nε(p) and

    2. p is a core point.

  • A point q is density-reachable from p if there exists a chain:

    p p1 p2 ... q

    where each step is directly density-reachable.

Intuition: you can “walk” from p to q while always stepping inside ε-neighborhoods of core points.

Cluster expansion (the algorithmic idea)#

  1. Pick an unvisited point.

  2. If it’s not core, it can’t start a cluster (it might later become a border point).

  3. If it is core, start a new cluster and grow it by repeatedly adding neighbors.

You can think of this as BFS/DFS over a graph:

  • vertices are points

  • an edge exists if two points are within ε

  • but only core points are allowed to “expand” the frontier

Pseudocode (classic DBSCAN)#

labels = UNASSIGNED
cluster_id = 0

for each point p:
  if p is visited: continue
  mark p visited
  neighbors = Nε(p)
  if |neighbors| < min_samples:
    label p as NOISE   # might be relabeled later
  else:
    start new cluster C = cluster_id
    expand(C, p, neighbors)
    cluster_id += 1

expand(C, p, neighbors):
  label p as C
  seed_set = neighbors
  while seed_set not empty:
    q = pop(seed_set)
    if q not visited:
      mark q visited
      q_neighbors = Nε(q)
      if |q_neighbors| >= min_samples:   # q is core
        seed_set = seed_set ∪ q_neighbors
    if q is UNASSIGNED or NOISE:
      label q as C

Two small details to notice:

  • Border points can be labeled as noise initially, then later pulled into a cluster.

  • Border points that touch multiple clusters can be assigned depending on traversal order.

4) DBSCAN from scratch (NumPy)#

For learning, we’ll implement the algorithm with a simple (dense) distance matrix. This is O(n²) memory/time, which is fine for small demos.

scikit-learn uses spatial indexing (KD-tree / ball tree) when possible for faster neighborhood queries.

def dbscan_from_scratch(X: np.ndarray, eps: float, min_samples: int) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    n = X.shape[0]

    eps2 = float(eps) ** 2

    # Squared Euclidean distance matrix via (x-y)^2 = x^2 + y^2 - 2 x·y
    sq = np.sum(X**2, axis=1, keepdims=True)
    sq_dists = sq + sq.T - 2 * (X @ X.T)
    sq_dists = np.maximum(sq_dists, 0.0)

    neighbors = [np.flatnonzero(sq_dists[i] <= eps2) for i in range(n)]
    is_core = np.array([len(nbrs) >= min_samples for nbrs in neighbors], dtype=bool)

    labels = np.full(n, -1, dtype=int)  # -1 = noise/unassigned
    visited = np.zeros(n, dtype=bool)

    cluster_id = 0

    for p in range(n):
        if visited[p]:
            continue
        visited[p] = True

        if not is_core[p]:
            continue

        # Start a new cluster
        labels[p] = cluster_id
        seed_set = set(neighbors[p].tolist())
        seed_set.discard(p)

        while seed_set:
            q = seed_set.pop()

            if not visited[q]:
                visited[q] = True
                if is_core[q]:
                    seed_set.update(neighbors[q].tolist())

            # If q is unassigned/noise, assign it to this cluster
            if labels[q] == -1:
                labels[q] = cluster_id

        cluster_id += 1

    return labels


def plot_clusters_2d(X: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
    X = np.asarray(X)
    labels = np.asarray(labels)

    label_text = np.where(labels == -1, "noise", labels.astype(str))
    fig = px.scatter(
        x=X[:, 0],
        y=X[:, 1],
        color=label_text,
        color_discrete_map={"noise": "lightgray"},
        title=title,
    )
    fig.update_traces(marker=dict(size=6, opacity=0.9))
    fig.update_yaxes(scaleanchor="x", scaleratio=1)
    return fig


eps = 0.25
min_samples = 6

labels_scratch = dbscan_from_scratch(X, eps=eps, min_samples=min_samples)
labels_sklearn = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X)

ari = adjusted_rand_score(labels_sklearn, labels_scratch)
print(f"Agreement with sklearn (ARI): {ari:.3f}")

plot_clusters_2d(X, labels_scratch, title="DBSCAN from scratch (NumPy)").show()
Agreement with sklearn (ARI): 1.000

5) Plotly lab: effect of eps and min_samples#

DBSCAN has two knobs that shape everything:

  • eps (ε): how far you consider “nearby”

  • min_samples: how many neighbors you need to be a “crowd”

If you remember one mental model:

  • Increase eps → neighborhoods connect more easily → clusters merge, less noise.

  • Increase min_samples → harder to be core → more noise, clusters can fragment.

def describe_labels(labels: np.ndarray) -> tuple[int, float]:
    labels = np.asarray(labels)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    noise_frac = float(np.mean(labels == -1))
    return n_clusters, noise_frac


eps_values = [0.18, 0.25, 0.35]
min_samples_fixed = 6

for eps in eps_values:
    labels = DBSCAN(eps=eps, min_samples=min_samples_fixed).fit_predict(X)
    n_clusters, noise_frac = describe_labels(labels)
    title = f"Vary ε: eps={eps}, min_samples={min_samples_fixed} | clusters={n_clusters}, noise={noise_frac:.1%}"
    plot_clusters_2d(X, labels, title=title).show()
eps_fixed = 0.25
min_samples_values = [3, 6, 12]

for m in min_samples_values:
    labels = DBSCAN(eps=eps_fixed, min_samples=m).fit_predict(X)
    n_clusters, noise_frac = describe_labels(labels)
    title = f"Vary min_samples: eps={eps_fixed}, min_samples={m} | clusters={n_clusters}, noise={noise_frac:.1%}"
    plot_clusters_2d(X, labels, title=title).show()

6) Plotly lab: noise detection + choosing eps#

DBSCAN’s noise label is often the feature you want the most.

  • It can act like a simple outlier detector.

  • It can keep a clustering result honest: “these points don’t really belong anywhere.”

A common heuristic for choosing eps is the k-distance plot:

  1. Pick k = min_samples.

  2. For each point, compute the distance to its k-th nearest neighbor.

  3. Sort those distances; look for the “elbow”.

The elbow is roughly where points transition from “in crowds” to “in sparse regions”.

def k_distance_plot(X: np.ndarray, k: int) -> tuple[go.Figure, np.ndarray]:
    nn = NearestNeighbors(n_neighbors=k)
    nn.fit(X)
    dists, _ = nn.kneighbors(X)
    kth = np.sort(dists[:, -1])

    fig = px.line(
        y=kth,
        title=f"{k}-distance plot (sorted distances to the {k}th nearest neighbor)",
        labels={"x": "points (sorted)", "y": f"distance to {k}th nearest neighbor"},
    )
    return fig, kth


k = 6
fig, kth = k_distance_plot(X, k=k)
fig.show()
# Noise detection demo: add a few extreme outliers
outliers = rng.uniform(low=[-3.5, -3.5], high=[3.5, 3.5], size=(12, 2))
X_out = np.vstack([X, outliers])

eps = 0.25
min_samples = 6
labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X_out)

n_clusters, noise_frac = describe_labels(labels)
title = f"DBSCAN noise detection | eps={eps}, min_samples={min_samples} | clusters={n_clusters}, noise={noise_frac:.1%}"
plot_clusters_2d(X_out, labels, title=title).show()

7) Comparison: DBSCAN vs k-means vs HDBSCAN#

A quick way to remember the tradeoffs:

Method

Needs k?

Finds non-convex shapes?

Labels noise?

Handles varying density well?

k-means

DBSCAN

⚠️ (often struggles)

HDBSCAN

  • k-means: great for roughly spherical clusters; fast; but forces every point into a cluster.

  • DBSCAN: great when clusters are “dense blobs/curves” and you want noise detection.

  • HDBSCAN: like DBSCAN, but it builds a hierarchy over density scales and picks stable clusters — often better when density varies.

# Same dataset, k-means vs DBSCAN
labels_kmeans = KMeans(n_clusters=2, n_init=10, random_state=42).fit_predict(X)
plot_clusters_2d(X, labels_kmeans, title="k-means (k=2) on two moons").show()

labels_db = DBSCAN(eps=0.25, min_samples=6).fit_predict(X)
plot_clusters_2d(X, labels_db, title="DBSCAN on two moons (eps=0.25, min_samples=6)").show()
# Optional: HDBSCAN (not installed by default in many environments)
try:
    import hdbscan  # type: ignore
except Exception:
    hdbscan = None

if hdbscan is None:
    print(
        "HDBSCAN not installed. If you want to run this section locally, install with:\n"
        "  pip install hdbscan\n"
        "or (often easier):\n"
        "  conda install -c conda-forge hdbscan"
    )
else:
    clusterer = hdbscan.HDBSCAN(min_cluster_size=20, min_samples=6)
    labels_h = clusterer.fit_predict(X)
    plot_clusters_2d(X, labels_h, title="HDBSCAN on two moons (auto density scale)").show()
HDBSCAN not installed. If you want to run this section locally, install with:
  pip install hdbscan
or (often easier):
  conda install -c conda-forge hdbscan

8) scikit-learn DBSCAN: practical parameter map#

Key parameters in sklearn.cluster.DBSCAN:

  • eps: distance threshold (your most important knob)

  • min_samples: how “crowded” you require a core point to be

  • metric: distance metric ("euclidean" default, but others are available)

  • n_jobs: parallelism for neighbor search (depends on sklearn version)

Practical pattern: scale features before DBSCAN unless you already trust the units.

# Scaling matters: eps is in *distance units*
X_weird = X.copy()
X_weird[:, 0] *= 10.0  # stretch x-axis

labels_unscaled = DBSCAN(eps=0.25, min_samples=6).fit_predict(X_weird)
plot_clusters_2d(X_weird, labels_unscaled, title="DBSCAN on stretched data (no scaling)").show()

X_scaled = StandardScaler().fit_transform(X_weird)
labels_scaled = DBSCAN(eps=0.25, min_samples=6).fit_predict(X_scaled)
plot_clusters_2d(X_scaled, labels_scaled, title="DBSCAN after StandardScaler (distance scale restored)").show()

9) Practical checklist + pitfalls#

  • Scaling is not optional when features have different units (DBSCAN uses distances directly).

  • Choosing eps is the main art:

    • k-distance plots help, but don’t over-trust the elbow.

    • what counts as “near” depends on the metric and the embedding space.

  • Varying density is DBSCAN’s classic failure mode (one ε can’t fit all). Consider HDBSCAN.

  • High dimensions make distance less informative (neighbors become “all the same distance”).

  • Border assignment ambiguity: border points can be assigned to different clusters depending on traversal order.

  • Complexity: naive DBSCAN is O(n²); practical implementations rely on efficient neighbor search.

Exercises#

  1. On the moons dataset, find an eps where DBSCAN merges both moons into one cluster. Explain why.

  2. Create a dataset with one dense blob and one sparse blob. Show where DBSCAN fails and explain how HDBSCAN helps.

  3. Implement a faster neighbor query (KD-tree / ball tree) and compare runtime vs the O(n²) version.

  4. Use DBSCAN on a PCA-reduced real dataset (e.g., digits) and inspect the noise points.

References#

  • Ester, Kriegel, Sander, Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (DBSCAN). (KDD)

  • scikit-learn docs: DBSCAN (sklearn.cluster.DBSCAN)

  • Campello, Moulavi, Sander (2013). Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN). (PAKDD)